Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Artigo em Inglês | MEDLINE | ID: mdl-38349824

RESUMO

Change captioning aims to describe the semantic change between two similar images. In this process, as the most typical distractor, viewpoint change leads to the pseudo changes about appearance and position of objects, thereby overwhelming the real change. Besides, since the visual signal of change appears in a local region with weak feature, it is difficult for the model to directly translate the learned change features into the sentence. In this paper, we propose a syntax-calibrated multi-aspect relation transformer to learn effective change features under different scenes, and build reliable cross-modal alignment between the change features and linguistic words during caption generation. Specifically, a multi-aspect relation learning network is designed to 1) explore the fine-grained changes under irrelevant distractors (e.g., viewpoint change) by embedding the relations of semantics and relative position into the features of each image; 2) learn two view-invariant image representations by strengthening their global contrastive alignment relation, so as to help capture a stable difference representation; 3) provide the model with the prior knowledge about whether and where the semantic change happened by measuring the relation between the representations of captured difference and the image pair. Through the above manner, the model can learn effective change features for caption generation. Further, we introduce the syntax knowledge of Part-of-Speech (POS) and devise a POS-based visual switch to calibrate the transformer decoder. The POS-based visual switch dynamically utilizes visual information during different word generation based on the POS of words. This enables the decoder to build reliable cross-modal alignment, so as to generate a high-level linguistic sentence about change. Extensive experiments show that the proposed method achieves the state-of-the-art performance on the three public datasets.

2.
IEEE Trans Image Process ; 33: 1938-1951, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38224517

RESUMO

Generalized Zero-Shot Learning (GZSL) aims at recognizing images from both seen and unseen classes by constructing correspondences between visual images and semantic embedding. However, existing methods suffer from a strong bias problem, where unseen images in the target domain tend to be recognized as seen classes in the source domain. To address this issue, we propose a Prototype-augmented Self-supervised Generative Network by integrating self-supervised learning and prototype learning into a feature generating model for GZSL. The proposed model enjoys several advantages. First, we propose a Self-supervised Learning Module to exploit inter-domain relationships, where we introduce anchors as a bridge between seen and unseen categories. In the shared space, we pull the distribution of the target domain away from the source domain and obtain domain-aware features. To our best knowledge, this is the first work to introduce self-supervised learning into GZSL as learning guidance. Second, a Prototype Enhancing Module is proposed to utilize class prototypes to model reliable target domain distribution in finer granularity. In this module, a Prototype Alignment mechanism and a Prototype Dispersion mechanism are combined to guide the generation of better target class features with intra-class compactness and inter-class separability. Extensive experimental results on five standard benchmarks demonstrate that our model performs favorably against state-of-the-art GZSL methods.

3.
IEEE Trans Image Process ; 33: 625-638, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38198242

RESUMO

How to model the effect of reflection is crucial for single image reflection removal (SIRR) task. Modern SIRR methods usually simplify the reflection formulation with the assumption of linear combination of a transmission layer and a reflection layer. However, the large variations in image content and the real-world picture-taking conditions often result in far more complex reflection. In this paper, we introduce a new screen-blur combination based on two important factors, namely the intensity and the blurriness of reflection, to better characterize the reflection formulation in SIRR. Specifically, we present Screen-blur Reflection Networks (SRNet), which executes the screen-blur formulation in its network design and adapts to the complex reflection on real scenes. Technically, SRNet consists of three components: a blended image generator, a reflection estimator and a reflection removal module. The image generator exploits the screen-blur combination to synthesize the training blended images. The reflection estimator learns the reflection layer and a blur degree that measures the level of blurriness for reflection. The reflection removal module further uses the blended image, blur degree and reflection layer to filter out the transmission layer in a cascaded manner. Superior results on three different SIRR methods are reported when generating the training data on the principle of the screen-blur combination. Moreover, extensive experiments on six datasets quantitatively and qualitatively demonstrate the efficacy of SRNet over the state-of-the-art methods.

4.
Artigo em Inglês | MEDLINE | ID: mdl-37943649

RESUMO

With high temporal resolution, high dynamic range, and low latency, event cameras have made great progress in numerous low-level vision tasks. To help restore low-quality (LQ) video sequences, most existing event-based methods usually employ convolutional neural networks (CNNs) to extract sparse event features without considering the spatial sparse distribution or the temporal relation in neighboring events. It brings about insufficient use of spatial and temporal information from events. To address this problem, we propose a new spiking-convolutional network (SC-Net) architecture to facilitate event-driven video restoration. Specifically, to properly extract the rich temporal information contained in the event data, we utilize a spiking neural network (SNN) to suit the sparse characteristics of events and capture temporal correlation in neighboring regions; to make full use of spatial consistency between events and frames, we adopt CNNs to transform sparse events as an extra brightness prior to being aware of detailed textures in video sequences. In this way, both the temporal correlation in neighboring events and the mutual spatial information between the two types of features are fully explored and exploited to accurately restore detailed textures and sharp edges. The effectiveness of the proposed network is validated in three representative video restoration tasks: deblurring, super-resolution, and deraining. Extensive experiments on synthetic and real-world benchmarks have illuminated that our method performs better than existing competing methods.

5.
Artigo em Inglês | MEDLINE | ID: mdl-37527324

RESUMO

Canonical correlation analysis (CCA) is a correlation analysis technique that is widely used in statistics and the machine-learning community. However, the high complexity involved in the training process lays a heavy burden on the processing units and memory system, making CCA nearly impractical in large-scale data. To overcome this issue, a novel CCA method that tries to carry out analysis on the dataset in the Fourier domain is developed in this article. Appling Fourier transform on the data, we can convert the traditional eigenvector computation of CCA into finding some predefined discriminative Fourier bases that can be learned with only element-wise dot product and sum operations, without complex time-consuming calculations. As the eigenvalues come from the sum of individual sample products, they can be estimated in parallel. Besides, thanks to the data characteristic of pattern repeatability, the eigenvalues can be well estimated with partial samples. Accordingly, a progressive estimate scheme is proposed, in which the eigenvalues are estimated through feeding data batch by batch until the eigenvalues sequence is stable in order. As a result, the proposed method shows its characteristics of extraordinarily fast and memory efficiencies. Furthermore, we extend this idea to the nonlinear kernel and deep models and obtained satisfactory accuracy and extremely fast training time consumption as expected. An extensive discussion on the fast Fourier transform (FFT)-CCA is made in terms of time and memory efficiencies. Experimental results on several large-scale correlation datasets, such as MNIST8M, X-RAY MICROBEAM SPEECH, and Twitter Users Data, demonstrate the superiority of the proposed algorithm over state-of-the-art (SOTA) large-scale CCA methods, as our proposed method achieves almost same accuracy with the training time of our proposed method being 1000 times faster. This makes our proposed models best practice models for dealing with large-scale correlation datasets. The source code is available at https://github.com/Mrxuzhao/FFTCCA.

6.
Artigo em Inglês | MEDLINE | ID: mdl-37467094

RESUMO

Audiovisual event localization aims to localize the event that is both visible and audible in a video. Previous works focus on segment-level audio and visual feature sequence encoding and neglect the event proposals and boundaries, which are crucial for this task. The event proposal features provide event internal consistency between several consecutive segments constructing one proposal, while the event boundary features offer event boundary consistency to make segments located at boundaries be aware of the event occurrence. In this article, we explore the proposal-level feature encoding and propose a novel context-aware proposal-boundary (CAPB) network to address audiovisual event localization. In particular, we design a local-global context encoder (LGCE) to aggregate local-global temporal context information for visual sequence, audio sequence, event proposals, and event boundaries, respectively. The local context from temporally adjacent segments or proposals contributes to event discrimination, while the global context from the entire video provides semantic guidance of temporal relationship. Furthermore, we enhance the structural consistency between segments by exploiting the above-encoded proposal and boundary representations. CAPB leverages the context information and structural consistency to obtain context-aware event-consistent cross-modal representation for accurate event localization. Extensive experiments conducted on the audiovisual event (AVE) dataset show that our approach outperforms the state-of-the-art methods by clear margins in both supervised event localization and cross-modality localization.

7.
Artigo em Inglês | MEDLINE | ID: mdl-37220051

RESUMO

Reflection from glasses is ubiquitous in daily life, but it is usually undesirable in photographs. To remove these unwanted noises, existing methods utilize either correlative auxiliary information or handcrafted priors to constrain this ill-posed problem. However, due to their limited capability to describe the properties of reflections, these methods are unable to handle strong and complex reflection scenes. In this article, we propose a hue guidance network (HGNet) with two branches for single image reflection removal (SIRR) by integrating image information and corresponding hue information. The complementarity between image information and hue information has not been noticed. The key to this idea is that we found that hue information can describe reflections well and thus can be used as a superior constraint for the specific SIRR task. Accordingly, the first branch extracts the salient reflection features by directly estimating the hue map. The second branch leverages these effective features, which can help locate salient reflection regions to obtain a high-quality restored image. Furthermore, we design a new cyclic hue loss to provide a more accurate optimization direction for the network training. Experiments substantiate the superiority of our network, especially its excellent generalization ability to various reflection scenes, as compared with state-of-the-arts both qualitatively and quantitatively. Source codes are available at https://github.com/zhuyr97/HGRR.

8.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7711-7725, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37015417

RESUMO

We study the problem of localizing audio-visual events that are both audible and visible in a video. Existing works focus on encoding and aligning audio and visual features at the segment level while neglecting informative correlation between segments of the two modalities and between multi-scale event proposals. We propose a novel Semantic and Relation Modulation Network (SRMN) to learn the above correlation and leverage it to modulate the related auditory, visual, and fused features. In particular, for semantic modulation, we propose intra-modal normalization and cross-modal normalization. The former modulates features of a single modality with the event-relevant semantic guidance of the same modality. The latter modulates features of two modalities by establishing and exploiting the cross-modal relationship. For relation modulation, we propose a multi-scale proposal modulating module and a multi-alignment segment modulating module to introduce multi-scale event proposals and enable dense matching between cross-modal segments, which strengthen correlations between successive segments within one proposal and between all segments. With the features modulated by the correlation information regarding audio-visual events, SRMN performs accurate event localization. Extensive experiments conducted on the public AVE dataset demonstrate that our method outperforms the state-of-the-art methods in both supervised event localization and cross-modality localization tasks.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(8): 9534-9551, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37022385

RESUMO

Image deraining is a challenging task since rain streaks have the characteristics of a spatially long structure and have a complex diversity. Existing deep learning-based methods mainly construct the deraining networks by stacking vanilla convolutional layers with local relations, and can only handle a single dataset due to catastrophic forgetting, resulting in a limited performance and insufficient adaptability. To address these issues, we propose a new image deraining framework to effectively explore nonlocal similarity, and to continuously learn on multiple datasets. Specifically, we first design a patchwise hypergraph convolutional module, which aims to better extract the nonlocal properties with higher-order constraints on the data, to construct a new backbone and to improve the deraining performance. Then, to achieve better generalizability and adaptability in real-world scenarios, we propose a biological brain-inspired continual learning algorithm. By imitating the plasticity mechanism of brain synapses during the learning and memory process, our continual learning process allows the network to achieve a subtle stability-plasticity tradeoff. This it can effectively alleviate catastrophic forgetting and enables a single network to handle multiple datasets. Compared with the competitors, our new deraining network with unified parameters attains a state-of-the-art performance on seen synthetic datasets and has a significantly improved generalizability on unseen real rainy images.


Assuntos
Algoritmos , Encéfalo , Memória
10.
IEEE Trans Pattern Anal Mach Intell ; 45(3): 3003-3018, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-35759595

RESUMO

Weakly supervised Referring Expression Grounding (REG) aims to ground a particular target in an image described by a language expression while lacking the correspondence between target and expression. Two main problems exist in weakly supervised REG. First, the lack of region-level annotations introduces ambiguities between proposals and queries. Second, most previous weakly supervised REG methods ignore the discriminative location and context of the referent, causing difficulties in distinguishing the target from other same-category objects. To address the above challenges, we design an entity-enhanced adaptive reconstruction network (EARN). Specifically, EARN includes three modules: entity enhancement, adaptive grounding, and collaborative reconstruction. In entity enhancement, we calculate semantic similarity as supervision to select the candidate proposals. Adaptive grounding calculates the ranking score of candidate proposals upon subject, location and context with hierarchical attention. Collaborative reconstruction measures the ranking result from three perspectives: adaptive reconstruction, language reconstruction and attribute classification. The adaptive mechanism helps to alleviate the variance of different referring expressions. Experiments on five datasets show EARN outperforms existing state-of-the-art methods. Qualitative results demonstrate that the proposed EARN can better handle the situation where multiple objects of a particular category are situated together.

11.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 12978-12995, 2023 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-35709118

RESUMO

Existing deep learning based de-raining approaches have resorted to the convolutional architectures. However, the intrinsic limitations of convolution, including local receptive fields and independence of input content, hinder the model's ability to capture long-range and complicated rainy artifacts. To overcome these limitations, we propose an effective and efficient transformer-based architecture for the image de-raining. First, we introduce general priors of vision tasks, i.e., locality and hierarchy, into the network architecture so that our model can achieve excellent de-raining performance without costly pre-training. Second, since the geometric appearance of rainy artifacts is complicated and of significant variance in space, it is essential for de-raining models to extract both local and non-local features. Therefore, we design the complementary window-based transformer and spatial transformer to enhance locality while capturing long-range dependencies. Besides, to compensate for the positional blindness of self-attention, we establish a separate representative space for modeling positional relationship, and design a new relative position enhanced multi-head self-attention. In this way, our model enjoys powerful abilities to capture dependencies from both content and position, so as to achieve better image content recovery while removing rainy artifacts. Experiments substantiate that our approach attains more appealing results than state-of-the-art methods quantitatively and qualitatively.

12.
Artigo em Inglês | MEDLINE | ID: mdl-36121960

RESUMO

The analysis of neuronal morphological data is essential to investigate the neuronal properties and brain mechanisms. The complex morphologies, absence of annotations, and sheer volume of these data pose significant challenges in neuronal morphological analysis, such as identifying neuron types and large-scale neuron retrieval, all of which require accurate measuring and efficient matching algorithms. Recently, many studies have been conducted to describe neuronal morphologies quantitatively using predefined measurements. However, hand-crafted features are usually inadequate for distinguishing fine-grained differences among massive neurons. In this article, we propose a novel morphology-aware contrastive graph neural network (MACGNN) for unsupervised neuronal morphological representation learning. To improve the retrieval efficiency in large-scale neuronal morphological datasets, we further propose Hash-MACGNN by introducing an improved deep hash algorithm to train the network end-to-end to learn binary hash representations of neurons. We conduct extensive experiments on the largest dataset, NeuroMorpho, which contains more than 100 000 neurons. The experimental results demonstrate the effectiveness and superiority of our MACGNN and Hash-MACGNN for large-scale neuronal morphological analysis.

13.
IEEE Trans Image Process ; 31: 3565-3577, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35312620

RESUMO

TV show captioning aims to generate a linguistic sentence based on the video and its associated subtitle. Compared to purely video-based captioning, the subtitle can provide the captioning model with useful semantic clues such as actors' sentiments and intentions. However, the effective use of subtitle is also very challenging, because it is the pieces of scrappy information and has semantic gap with visual modality. To organize the scrappy information together and yield a powerful omni-representation for all the modalities, an efficient captioning model requires understanding video contents, subtitle semantics, and the relations in between. In this paper, we propose an Intra- and Inter-relation Embedding Transformer (I2Transformer), consisting of an Intra-relation Embedding Block (IAE) and an Inter-relation Embedding Block (IEE) under the framework of a Transformer. First, the IAE captures the intra-relation in each modality via constructing the learnable graphs. Then, IEE learns the cross attention gates, and selects useful information from each modality based on their inter-relations, so as to derive the omni-representation as the input to the Transformer. Experimental results on the public dataset show that the I2Transformer achieves the state-of-the-art performance. We also evaluate the effectiveness of the IAE and IEE on two other relevant tasks of video with text inputs, i.e., TV show retrieval and video-guided machine translation. The encouraging performance further validates that the IAE and IEE blocks have a good generalization ability. The code is available at https://github.com/tuyunbin/I2Transformer.


Assuntos
Intenção , Semântica
14.
IEEE Trans Image Process ; 31: 2726-2738, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35324439

RESUMO

Video captioning aims to generate a natural language sentence to describe the main content of a video. Since there are multiple objects in videos, taking full exploration of the spatial and temporal relationships among them is crucial for this task. The previous methods wrap the detected objects as input sequences, and leverage vanilla self-attention or graph neural network to reason about visual relations. This cannot make full use of the spatial and temporal nature of a video, and suffers from the problems of redundant connections, over-smoothing, and relation ambiguity. In order to address the above problems, in this paper we construct a long short-term graph (LSTG) that simultaneously captures short-term spatial semantic relations and long-term transformation dependencies. Further, to perform relational reasoning over the LSTG, we design a global gated graph reasoning module (G3RM), which introduces a global gating based on global context to control information propagation between objects and alleviate relation ambiguity. Finally, by introducing G3RM into Transformer instead of self-attention, we propose the long short-term relation transformer (LSRT) to fully mine objects' relations for caption generation. Experiments on MSVD and MSR-VTT datasets show that the LSRT achieves superior performance compared with state-of-the-art methods. The visualization results indicate that our method alleviates problem of over-smoothing and strengthens the ability of relational reasoning.

15.
IEEE Trans Pattern Anal Mach Intell ; 44(2): 710-722, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-30969916

RESUMO

With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. In particular, we are interested in generating longer, richer and more fine-grained sentences and paragraphs as image descriptions. Image captioning can be translated to the task of sequential language prediction given visual content, where the output sequence forms natural language description with plausible grammar. However, existing image captioning methods focus only on language policy while not visual policy, and thus fail to capture visual context that are crucial for compositional reasoning such as object relationships (e.g., "man riding horse") and visual comparisons (e.g., "small(er) cat"). This issue is especially severe when generating longer sequences such as a paragraph. To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for fine-grained image-to-language generation: image sentence captioning and image paragraph captioning. During captioning, CAVP explicitly considers the previous visual attentions as context, and decides whether the context is used for the current word/sentence generation given the current visual attention. Compared against traditional visual attention mechanism that only fixes a single visual region at each step, CAVP can attend to complex visual compositions over time. The whole image captioning model-CAVP and its subsequent language policy network-can be efficiently optimized end-to-end by using an actor-critic policy gradient method. We have demonstrated the effectiveness of CAVP by state-of-the-art performances on MS-COCO and Stanford captioning datasets, using various metrics and sensible visualizations of qualitative visual context.


Assuntos
Algoritmos , Políticas , Animais , Cavalos , Humanos
16.
IEEE Trans Neural Netw Learn Syst ; 33(11): 6802-6816, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-34081590

RESUMO

Deep learning-based methods have achieved notable progress in removing blocking artifacts caused by lossy JPEG compression on images. However, most deep learning-based methods handle this task by designing black-box network architectures to directly learn the relationships between the compressed images and their clean versions. These network architectures are always lack of sufficient interpretability, which limits their further improvements in deblocking performance. To address this issue, in this article, we propose a model-driven deep unfolding method for JPEG artifacts removal, with interpretable network structures. First, we build a maximum posterior (MAP) model for deblocking using convolutional dictionary learning and design an iterative optimization algorithm using proximal operators. Second, we unfold this iterative algorithm into a learnable deep network structure, where each module corresponds to a specific operation of the iterative algorithm. In this way, our network inherits the benefits of both the powerful model ability of data-driven deep learning method and the interpretability of traditional model-driven method. By training the proposed network in an end-to-end manner, all learnable modules can be automatically explored to well characterize the representations of both JPEG artifacts and image content. Experiments on synthetic and real-world datasets show that our method is able to generate competitive or even better deblocking results, compared with state-of-the-art methods both quantitatively and qualitatively.

17.
IEEE Trans Med Imaging ; 40(11): 3205-3216, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-33999814

RESUMO

Manually labeling neurons from high-resolution but noisy and low-contrast optical microscopy (OM) images is tedious. As a result, the lack of annotated data poses a key challenge when applying deep learning techniques for reconstructing neurons from noisy and low-contrast OM images. While traditional tracing methods provide a possible way to efficiently generate labels for supervised network training, the generated pseudo-labels contain many noisy and incorrect labels, which lead to severe performance degradation. On the other hand, the publicly available dataset, BigNeuron, provides a large number of single 3D neurons that are reconstructed using various imaging paradigms and tracing methods. Though the raw OM images are not fully available for these neurons, they convey essential morphological priors for complex 3D neuron structures. In this paper, we propose a new approach to exploit morphological priors from neurons that have been reconstructed for training a deep neural network to extract neuron signals from OM images. We integrate a deep segmentation network in a generative adversarial network (GAN), expecting the segmentation network to be weakly supervised by pseudo-labels at the pixel level while utilizing the supervision of previously reconstructed neurons at the morphology level. In our morphological-prior-guided neuron reconstruction GAN, named MP-NRGAN, the segmentation network extracts neuron signals from raw images, and the discriminator network encourages the extracted neurons to follow the morphology distribution of reconstructed neurons. Comprehensive experiments on the public VISoR-40 dataset and BigNeuron dataset demonstrate that our proposed MP-NRGAN outperforms state-of-the-art approaches with less training effort.


Assuntos
Processamento de Imagem Assistida por Computador , Microscopia , Redes Neurais de Computação , Neurônios
18.
IEEE Trans Neural Netw Learn Syst ; 32(12): 5445-5455, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-33667168

RESUMO

Learning to adapt to a series of different goals in visual navigation is challenging. In this work, we present a model-embedded actor-critic architecture for the multigoal visual navigation task. To enhance the task cooperation in multigoal learning, we introduce two new designs to the reinforcement learning scheme: inverse dynamics model (InvDM) and multigoal colearning (MgCl). Specifically, InvDM is proposed to capture the navigation-relevant association between state and goal and provide additional training signals to relieve the sparse reward issue. MgCl aims at improving the sample efficiency and supports the agent to learn from unintentional positive experiences. Besides, to further improve the scene generalization capability of the agent, we present an enhanced navigation model that consists of two self-supervised auxiliary task modules. The first module, which is named path closed-loop detection, helps to understand whether the state has been experienced. The second one, namely the state-target matching module, tries to figure out the difference between state and goal. Extensive results on the interactive platform AI2-THOR demonstrate that the agent trained with the proposed method converges faster than state-of-the-art methods while owning good generalization capability. The video demonstration is available at https://vsislab.github.io/mgvn.

19.
Neural Netw ; 139: 77-85, 2021 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-33684611

RESUMO

Deep Convolutional Neural Networks (CNNs), such as Dense Convolutional Network (DenseNet), have achieved great success for image representation learning by capturing deep hierarchical features. However, most existing network architectures of simply stacking the convolutional layers fail to enable them to fully discover local and global feature information between layers. In this paper, we mainly investigate how to enhance the local and global feature learning abilities of DenseNet by fully exploiting the hierarchical features from all convolutional layers. Technically, we propose an effective convolutional deep model termed Dense Residual Network (DRN) for the task of optical character recognition. To define DRN, we propose a refined residual dense block (r-RDB) to retain the ability of local feature fusion and local residual learning of original RDB, which can reduce the computing efforts of inner layers at the same time. After fully capturing local residual dense features, we utilize the sum operation and several r-RDBs to construct a new block termed global dense block (GDB) by imitating the construction of dense blocks to adaptively learn global dense residual features in a holistic way. Finally, we use two convolutional layers to design a down-sampling block to reduce the global feature size and extract more informative deeper features. Extensive results show that our DRN can deliver enhanced results, compared with other related deep models.


Assuntos
Aprendizado Profundo , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão/métodos
20.
IEEE Trans Neural Netw Learn Syst ; 32(2): 722-735, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-32275611

RESUMO

Person re-identification (re-ID) favors discriminative representations over unseen shots to recognize identities in disjoint camera views. Effective methods are developed via pair-wise similarity learning to detect a fixed set of region features, which can be mapped to compute the similarity value. However, relevant parts of each image are detected independently without referring to the correlation on the other image. Also, region-based methods spatially position local features for their aligned similarities. In this article, we introduce the deep coattention-based comparator (DCC) to fuse codependent representations of paired images so as to correlate the best relevant parts and produce their relative representations accordingly. The proposed approach mimics the human foveation to detect the distinct regions concurrently across images and alternatively attends to fuse them into the similarity learning. Our comparator is capable of learning representations relative to a test shot and well-suited to reidentifying pedestrians in surveillance. We perform extensive experiments to provide the insights and demonstrate the state of the arts achieved by our method in benchmark data sets: 1.2 and 2.5 points gain in mean average precision (mAP) on DukeMTMC-reID and Market-1501, respectively.


Assuntos
Atenção , Reconhecimento Facial Automatizado , Aprendizado Profundo , Processamento de Imagem Assistida por Computador/métodos , Algoritmos , Inteligência Artificial , Benchmarking , Identificação Biométrica , Bases de Dados Factuais , Humanos , Redes Neurais de Computação , Reprodutibilidade dos Testes , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...